Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

نویسندگان

  • Gaurav Suri
  • Bob Janssens
  • W. Kent Fuchs
چکیده

Rollback techniques that use message logging and deterministic replay can be used in parallel systems to recover a failed node without involving other nodes. Distributed shared memory (DSM) systems cannot directly apply message-passing logging techniques because they use inherently nondeterministic asynchronous communication. This paper presents new logging schemes that reduce the typically high overhead for logging in DSM. Our algorithm for sequentially consistent systems tracks rather than logs accesses to shared memory. In an extension of this method to lazy release consistency, the per-access overhead of tracking has been completely eliminated. Measurements with parallel applications show a significant reduction in failure-free overhead.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Logging and Checkpointing Scheme for Recoverable Distributed Shared Memory

The distributed shared memory(DSM) system transforms an existing network of workstations to a powerful shared-memory parallel computer which could deliver superior price/performance. However, with more workstations engaged in the system and longer execution time, the probability of faults increases which could render the system useless. Several checkpointing and logging schemes have been propos...

متن کامل

Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems

Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing met...

متن کامل

A New Log-Based Approach to Independent Recovery in Distributed Shraed Memory Systems

This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems. We propose a new log-based recovery approach, which can tolerate multiple node failures. The recovery approach employs an independent checkpointing technique and a new logging scheme. The independent checkpointing technique periodically interrupts the execution of a node to save the node’s state...

متن کامل

An efficient causal logging scheme for recoverable distributed shared memory systems

This paper presents a causal logging scheme for the lazy release consistent distributed shared memory systems. Causal logging is a very attractive approach to provide the fault tolerance for the distributed systems, since it eliminates the need of stable logging. However, since inter-process dependency must causally be transferred with the normal messages, the excessive message overhead has bee...

متن کامل

Using Peer Support to Reduce Fault-Tolerant Overhead in Distributed Shared Memories

We present a peer logging system for reducing performance overhead in fault-tolerant distributed shared memory systems. Our system provides fault-tolerant shared memory using individual checkpointing and rollback. Peer logging logs DSM modification messages to remote nodes instead of to local disks. We present results for implementations of our fault-tolerant technique using simulations of both...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995